Background

Based on the following steps, we found that there is 421570 instances in this dataset. Among the dataset, there are total 45 retail Stores sales result in this dataset. In each store, there are about 99 Departments.

Missing Values

Before we get into the analysis, we first deal with the missing values. As we observe the following, the missing values are constantly stacking on Five features which are MarkDown1, MarkDown2, MarkDown3, MarkDown4, MarkDown5. Besides, missing values are not found among other features. For this reason, we are going to grub deeper into those features to check where those missing values are correlatable enough to be replaced.

The following is the basic proportion of missing values for every store. Based on the table 1, we could observe that the proportion of missing among these five features are ranging between (63%-90%). With these high volumes of missing values, we are going to grub more to check whether the values are randomly missing or not. Table 2 shows that the range of data without missing values from 2011-11-11 to 2012-10-26 and Table 3 shows that the range of data with missing values from 2010-02-05 to 2011-11-04. In this case, we could observe that the large volume of missing values are NOT randomly missed. In the consideration of the huge volumes as well as the missing patterns, I would propose to drop the five features (i.e. columns) instead of records (i.e. rows) because this is important to keep as many records as we could for the time-series analysis.

Categorical Features

This dataset contains two categorical features which are "IsHoliday" and "Store_Type" variables. In order to proceed time-series analysis, these Cat attributes will be dummied into individual variables as Table 4. As you see, the binary "IsHoliday" and "Store_Type" features are divided into 2 columns (i.e. 1,0) and 3 columns (i.e. A,B,C)

Outliers

Next, we are going to check out the outltiers on continuous variables against every stores.

Weekly_Sales: Due to geographical or holiday features, the Sales associated with every store should varies. Refer to Figure 1, the Weekly sales mot only varies quite a bit, a considerablely large amount data point are out of the range of typical range of sales. As mentioned, the variation might be caused by mixed factors. For this reason, I may just leave those "Outliers" as is for now.

Temperature: Refer to figure 2, the other variable "Temperature" looks very consistant across the stores. Also, there are NO outliers significantly out of their typical range among the stores.

Fuel Price: Refer to figure 3, the feature "Fuel_Price" which is even more consistant than the "Temperature" variable ranges between 2.5 to 4.5 among the stores. So that I would believe that there is NO outliers existing in this feature. This figure makes sense to us because Fuel Price has its market Price across the countries, so that there should be not much variation in each geographical area as expected.

Consumer Price Index (CPI): According to U.S. Bureau of Labor Statistics, the meaning of CPI is "a measure of the average change over time in the prices paid by urban consumers for a market basket of consumer goods and services". Refer to Figure 4, the CPI value, ranging from 120 to 230, differs quite much among the stores. However, there is NO existing outliers shown in this dataset. This figure relatively indicates us the consuming power across stores. For hypothetical assumption, the area with higher CPI values should expect more consuming power which is supposed to generate more sales amount for the store, or vice versa. Further analysis will be conducted to measure this aspect of feature along with this analysis proceed.

Unemployment: The unemployment rate is another index shows the consuming power in the country. Refer to the Figure 5, the data shows that most of the values fall between 6 to 8 withouth outliers among the stores. Similar to the "CPI" feature, Further analysis will be conducted to measure how this feature influence to the Weekly Sales among the stores.

In [4]:
import pandas as pd
import numpy as np

trainDf = pd.read_csv("./data/train.csv")
feaDf = pd.read_csv("./data/features.csv").drop(["IsHoliday"], axis=1)
storesDf = pd.read_csv("./data/stores.csv")


df1 = pd.merge(trainDf, feaDf, on = ["Store", "Date"])
#print(df1.shape)
train = pd.merge(df1, storesDf, on = "Store")
print("Dataset size:",train.shape)

#Rename column names
train.columns = ['Store', 'Dept', 'Date', 'Weekly_Sales', 'IsHoliday', 'Temperature',
       'Fuel_Price', 'MarkDown1', 'MarkDown2', 'MarkDown3', 'MarkDown4',
       'MarkDown5', 'CPI', 'Unemployment', 'Store_Type', 'Size']

display(train.head(3))
Dataset size: (421570, 16)
Store Dept Date Weekly_Sales IsHoliday Temperature Fuel_Price MarkDown1 MarkDown2 MarkDown3 MarkDown4 MarkDown5 CPI Unemployment Store_Type Size
0 1 1 2010-02-05 24924.50 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
1 1 2 2010-02-05 50605.27 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
2 1 3 2010-02-05 13740.12 False 42.31 2.572 NaN NaN NaN NaN NaN 211.096358 8.106 A 151315
In [5]:
cntLst = []
print("Table 1")
print('''Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''')
print()
for store in train.Store.unique():
    for var in ["MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"]:
        cntLst.append(train[var][train.Store == store].isnull().sum())
    print("Store #%d"%(store), end="")
    print(" with %d instance has missing values"%(train[train.Store == store].shape[0]), end=" ")
    print(cntLst, end=" ")
    print("in percent", end=" ")
    print(np.round(np.array(cntLst) / train[train.Store == store].shape[0], 2)*100, end="\n")
    cntLst=[]
Table 1
Proportion of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store #1 with 10244 instance has missing values [6587, 7229, 6656, 6587, 6587] in percent [ 64.  71.  65.  64.  64.]
Store #2 with 10238 instance has missing values [6575, 7218, 6646, 6575, 6575] in percent [ 64.  71.  65.  64.  64.]
Store #3 with 9036 instance has missing values [5791, 6554, 6230, 5922, 5791] in percent [ 64.  73.  69.  66.  64.]
Store #4 with 10272 instance has missing values [6596, 7171, 6739, 6668, 6596] in percent [ 64.  70.  66.  65.  64.]
Store #5 with 8999 instance has missing values [5776, 6660, 6279, 5906, 5776] in percent [ 64.  74.  70.  66.  64.]
Store #6 with 10211 instance has missing values [6559, 7057, 6702, 6559, 6559] in percent [ 64.  69.  66.  64.  64.]
Store #7 with 9762 instance has missing values [6256, 7209, 6396, 6256, 6256] in percent [ 64.  74.  66.  64.  64.]
Store #8 with 9895 instance has missing values [6360, 6982, 6496, 6429, 6360] in percent [ 64.  71.  66.  65.  64.]
Store #9 with 8867 instance has missing values [5657, 6791, 6225, 5783, 5657] in percent [ 64.  77.  70.  65.  64.]
Store #10 with 10315 instance has missing values [6669, 7736, 6957, 6669, 6669] in percent [ 65.  75.  67.  65.  65.]
Store #11 with 10062 instance has missing values [6438, 7003, 6508, 6580, 6438] in percent [ 64.  70.  65.  65.  64.]
Store #12 with 9705 instance has missing values [6211, 6961, 6417, 6211, 6211] in percent [ 64.  72.  66.  64.  64.]
Store #13 with 10474 instance has missing values [6745, 7330, 6816, 6745, 6745] in percent [ 64.  70.  65.  64.  64.]
Store #14 with 10040 instance has missing values [6451, 7157, 6520, 6451, 6451] in percent [ 64.  71.  65.  64.  64.]
Store #15 with 9901 instance has missing values [6359, 7055, 6705, 6359, 6359] in percent [ 64.  71.  68.  64.  64.]
Store #16 with 9443 instance has missing values [5961, 6918, 6638, 5961, 5961] in percent [ 63.  73.  70.  63.  63.]
Store #17 with 9864 instance has missing values [6304, 7285, 6864, 6304, 6304] in percent [ 64.  74.  70.  64.  64.]
Store #18 with 9859 instance has missing values [6295, 6921, 6506, 6295, 6295] in percent [ 64.  70.  66.  64.  64.]
Store #19 with 10148 instance has missing values [6526, 7095, 6738, 6526, 6526] in percent [ 64.  70.  66.  64.  64.]
Store #20 with 10214 instance has missing values [6561, 6992, 6631, 6561, 6561] in percent [ 64.  68.  65.  64.  64.]
Store #21 with 9582 instance has missing values [6133, 6808, 6402, 6133, 6133] in percent [ 64.  71.  67.  64.  64.]
Store #22 with 9688 instance has missing values [6212, 7037, 6552, 6212, 6212] in percent [ 64.  73.  68.  64.  64.]
Store #23 with 10050 instance has missing values [6464, 7169, 6743, 6464, 6464] in percent [ 64.  71.  67.  64.  64.]
Store #24 with 10228 instance has missing values [6570, 7072, 6641, 6570, 6570] in percent [ 64.  69.  65.  64.  64.]
Store #25 with 9804 instance has missing values [6296, 6915, 6569, 6296, 6296] in percent [ 64.  71.  67.  64.  64.]
Store #26 with 9854 instance has missing values [6310, 7002, 6446, 6382, 6310] in percent [ 64.  71.  65.  65.  64.]
Store #27 with 10225 instance has missing values [6569, 7143, 6640, 6569, 6569] in percent [ 64.  70.  65.  64.  64.]
Store #28 with 10113 instance has missing values [6494, 7058, 6704, 6494, 6494] in percent [ 64.  70.  66.  64.  64.]
Store #29 with 9455 instance has missing values [6088, 6946, 6352, 6088, 6088] in percent [ 64.  73.  67.  64.  64.]
Store #30 with 7156 instance has missing values [4693, 6338, 5287, 6241, 4591] in percent [ 66.  89.  74.  87.  64.]
Store #31 with 10142 instance has missing values [6528, 7233, 6598, 6528, 6528] in percent [ 64.  71.  65.  64.  64.]
Store #32 with 10202 instance has missing values [6573, 7144, 6645, 6573, 6573] in percent [ 64.  70.  65.  64.  64.]
Store #33 with 6487 instance has missing values [4194, 5552, 5547, 6156, 4099] in percent [ 65.  86.  86.  95.  63.]
Store #34 with 10224 instance has missing values [6616, 7246, 6828, 6616, 6616] in percent [ 65.  71.  67.  65.  65.]
Store #35 with 9528 instance has missing values [6112, 7582, 6315, 6112, 6112] in percent [ 64.  80.  66.  64.  64.]
Store #36 with 6222 instance has missing values [4058, 5382, 5380, 5739, 3968] in percent [ 65.  86.  86.  92.  64.]
Store #37 with 7206 instance has missing values [4750, 6266, 5112, 6944, 4542] in percent [ 66.  87.  71.  96.  63.]
Store #38 with 7362 instance has missing values [4658, 6414, 5241, 6235, 4658] in percent [ 63.  87.  71.  85.  63.]
Store #39 with 9878 instance has missing values [6356, 7046, 6492, 6356, 6356] in percent [ 64.  71.  66.  64.  64.]
Store #40 with 10017 instance has missing values [6418, 7268, 6699, 6418, 6418] in percent [ 64.  73.  67.  64.  64.]
Store #41 with 10088 instance has missing values [6439, 7083, 6509, 6439, 6439] in percent [ 64.  70.  65.  64.  64.]
Store #42 with 6953 instance has missing values [4507, 6044, 4862, 6443, 4352] in percent [ 65.  87.  70.  93.  63.]
Store #43 with 6751 instance has missing values [4341, 6045, 4715, 6464, 4341] in percent [ 64.  90.  70.  96.  64.]
Store #44 with 7169 instance has missing values [4649, 6414, 5213, 6600, 4548] in percent [ 65.  89.  73.  92.  63.]
Store #45 with 9637 instance has missing values [6184, 6791, 6318, 6184, 6184] in percent [ 64.  70.  66.  64.  64.]
In [6]:
cntLst = []
print("Table 2")
print('''Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == False][train.Store == store].unique()))
    print("Store %d %s - %s"%(store,min(arr),max(arr)))
Table 2
Date range WITHOUT Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 2 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 3 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 4 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 5 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 6 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 7 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 8 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 9 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 10 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 11 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 12 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 13 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 14 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 15 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 16 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 17 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 18 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 19 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 20 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 21 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 22 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 23 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 24 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 25 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 26 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 27 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 28 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 29 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 30 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 31 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 32 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 33 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 34 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 35 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 36 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 37 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 38 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 39 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 40 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 41 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 42 2011-11-11 00:00:00 - 2012-10-19 00:00:00
Store 43 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 44 2011-11-11 00:00:00 - 2012-10-26 00:00:00
Store 45 2011-11-11 00:00:00 - 2012-10-26 00:00:00
In [7]:
cntLst = []
print("Table 3")
print('''Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"''', end="\n")
print()
for store in train.Store.unique():
    arr = sorted(pd.to_datetime(train.Date[train.MarkDown1.isnull() == True][train.Store == store].unique()))
    print("Store %d %s - %s"%(store,min(arr),max(arr)))
Table 3
Date range of Missing values for features:"MarkDown1", "MarkDown2","MarkDown3","MarkDown4","MarkDown5"

Store 1 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 2 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 3 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 4 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 5 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 6 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 7 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 8 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 9 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 10 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 11 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 12 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 13 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 14 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 15 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 16 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 17 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 18 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 19 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 20 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 21 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 22 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 23 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 24 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 25 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 26 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 27 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 28 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 29 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 30 2010-02-05 00:00:00 - 2012-10-05 00:00:00
Store 31 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 32 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 33 2010-02-05 00:00:00 - 2012-06-29 00:00:00
Store 34 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 35 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 36 2010-02-05 00:00:00 - 2011-12-09 00:00:00
Store 37 2010-02-05 00:00:00 - 2012-08-31 00:00:00
Store 38 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 39 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 40 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 41 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 42 2010-02-05 00:00:00 - 2012-10-26 00:00:00
Store 43 2010-02-05 00:00:00 - 2011-11-04 00:00:00
Store 44 2010-02-05 00:00:00 - 2011-12-09 00:00:00
Store 45 2010-02-05 00:00:00 - 2011-11-04 00:00:00
In [8]:
print("Table 4")
print('''Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 ''')
#display(pd.get_dummies(train[["IsHoliday_x"]]))
pd.concat((pd.get_dummies(train["Store_Type"]),pd.get_dummies(train["IsHoliday"])), axis=1).head(5)
Table 4
Split the feature "Type" into A,B,C and "IsHoliday_x" into 1,0 
Out[8]:
A B C False True
0 1 0 0 1 0
1 1 0 0 1 0
2 1 0 0 1 0
3 1 0 0 1 0
4 1 0 0 1 0
In [145]:
import plotly.graph_objs as go
from plotly import __version__
from plotly.offline import download_plotlyjs, init_notebook_mode, plot, iplot
import plotly.offline as offline

init_notebook_mode(connected=True)

namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Weekly_Sales[train.Store == int(name)]/1000,2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Sales Values by each store (Figure 1)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Weekly Sales (in thousands)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-10,200], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig, show_link=True)
In [139]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Temperature[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Temperature Values by each store (Figure 2)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Temperature(F)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[-5,120], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
In [138]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Fuel_Price[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Fuel Price Values by each store (Figure 3)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Fuel Price($)",
                 titlefont=dict(size=18),
                 zeroline=True, range=[2,5], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
In [137]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.CPI[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of CPI Values by each store (Figure 4)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="CPI",
                 titlefont=dict(size=18),
                 zeroline=True, range=[100,250], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)
In [148]:
namesLst=list(map(str,train.Store.unique()))

aa=[]
salesLst=[]
for name in namesLst:
    aa=round(train.Unemployment[train.Store == int(name)],2)
    salesLst = salesLst + [aa]

traces=[]

for name, sales in zip (namesLst, salesLst):
    traces.append(go.Box(
        x=sales,
        name=name
    ))

layout = go.Layout(
    title = 'Range of Unemployment Rate by each store (Figure 5)',
    #yaxis =dict(autorange=True, showgrid=True,zeroline=True, autotick=False),
    
    autosize=False,
    width=700,
    height=1000,
    yaxis =dict(title = "Store Number", 
                exponentformat='e',
                showexponent='all',
                titlefont=dict(size=18),
                tick0=5,ticks="outside", 
                dtick=1, 
                tickwidth=2, 
                showgrid=True),
    xaxis = dict(title="Unemployment Rate",
                 titlefont=dict(size=18),
                 zeroline=True, range=[0,20], showgrid=True),
    margin = dict(l=60,r=30, b=80, t=40),
    showlegend=False
)
    
fig = go.Figure(data=traces, layout=layout)
iplot(fig)